1. Lots and Lots of Space
All the storage services can take huge amounts of data. People
have been known to store billions of rows in one table in the table
service, and to store terabytes of data in the blob service. However,
there is not an infinite amount of space available, since Microsoft owns
only so many disks. But this is not a limit that you or anyone else will
hit.
You could draw a parallel to IPv6, which was designed as a
replacement for the current system of IP addresses, IPv4. The fact is
that IPv4 is running out of IP addresses to hand out—the limit is 4.3
billion, and the number of available IP addresses is quickly
diminishing. IPv6 has a finite number of addresses as well, but the
limit will never be reached, since that finite number is incredibly
large: 3.4 × 1038, or 3.4 followed by 38
zeros!
The same holds true for these storage services. You need not worry
about running out of space on a file share, negotiating with vendors, or
racking up new disks. You pay only for what you use, and you can be sure
that you’ll always have more space if you need it. Note that this
represents the total data you can store. There are limits on how large
an individual blob or a table entity/row can be, but not on the number
of blobs or the number of rows you can have.
2. Distribution
All the storage services are massively distributed. This means
that, instead of having a few huge machines serving out your data, your
data is spread out over several smaller machines. These smaller machines
do have higher rates of failure than specialized storage infrastructure
such as storage area networks (SANs), but Windows Azure storage deals
with failure through software. It implements various distributed
software techniques to ensure that it stays available and reliable in
the presence of failures in one or more of the machines on which it
runs.
3. Scalability
All the storage services are scalable. However,
scalable is a loaded, often abused word. In this
context, it means your performance should stay the same, regardless of
the amount of data you have. (This statement comes with some caveats, of
course. When you learn about the table service, you’ll see how you can
influence performance through partitioning.) More importantly,
performance stays the same when load increases. If your site shows up on
the home page of Slashdot or Digg or Reddit, Windows Azure does magic
behind the scenes to ensure that the time taken to serve requests stays
the same. Commodity machines can take only so much load, because there
are multiple mechanisms at play—from making multiple copies to having
multiple levels of hot data caching.
4. Replication
All data is replicated multiple times. In case of a hardware
failure or data corruption in any of the replicas, there are always more
copies from which to recover data. This happens under the covers, so you
don’t need to worry about this explicitly.
5. Consistency
Several distributed storage services are eventually consistent, which means that when an
operation is performed, it may take some time (usually a few seconds)
for the data you retrieve to reflect those changes. Eventual consistency
usually means better scalability and performance: if you don’t need to
make changes on several nodes, you have better availability. The
downside is that it makes writing code a lot trickier, because it’s
possible to have a write operation followed by a read that doesn’t see
the results of the write you just performed.
Note: Don’t misinterpret this description—eventual consistency is
great in specialized scenarios. For example, Amazon’s shopping cart
service is a canonical example of an eventually consistent
application. The underlying store it writes to (Dynamo) is a
state-of-the-art distributed storage system. It lets Amazon choose
between insert/add performance and not retrieving the latest version
of the data. It doesn’t reflect changes instantly, but by not having
to do so, it gives Amazon the ability to add items to shopping carts
almost instantly, and more importantly, to never miss an item
added.Amazon decided that not losing shopping cart items was very
important and worth the trade-off of a minuscule percentage of users’
shopping carts not seeming to have all items at all times. For more
information, read Amazon’s paper on the topic at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.
Windows Azure storage is not eventually consistent; it is
instantly/strongly consistent. This means that when you do an update or a
delete, the changes are instantly visible to all future API calls. The
team decided to do this since they felt that eventual consistency would
make writing code against the storage services quite tricky, and more
important, they could achieve very good performance without needing
this. While full database-style transactions aren’t available, a limited
form is available where you can batch calls to one partition.
Application code must ensure consistency across partitions, and across
different storage account calls.
6. RESTful HTTP APIs
All the storage services are exposed through a RESTful HTTP API.
You’ll learn about the building blocks of these APIs later in this
chapter. All APIs can be accessed from both inside and outside Microsoft
data centers. This is a big deal. This means that
you could host your website or service in your current data center, and
pick and choose what service you want to use. For example, you could use
only blob storage, or use only the queue service, instead of having to
host your code inside Windows Azure as well. This is similar to how
several websites use Amazon’s S3 service. For example, Twitter runs code
in its own infrastructure, but uses S3 to store profile images.
Another advantage of having open RESTful APIs is that it is
trivial to build a client library in any language/platform. Microsoft
ships one in .NET, but there are bindings in Python, Ruby, and Erlang,
just to name a few. Later in this chapter, you will learn how to build a
rudimentary library to illustrate the fundamental concepts, but in most
of the storage code, you’ll be using the official Microsoft client
library. If you want to implement your own library in a different
language/environment, just follow along through the sample in this
chapter, and you should find it easy to follow the same steps in your
chosen environment.
7. Geodistribution
When you create your storage account, you can pick in which
geographical location you want your data to reside. This is great for
not only ensuring that your data is close to your code or your
customers, but also spreading your data out geographically. You don’t
want a natural disaster in one region to take out your only copy of some
valuable data.
8. Pay for Play
With Windows Azure storage, like the rest of Windows Azure, you
pay only for the storage you currently use, and for bandwidth
transferring data in and out of the system.